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I. Introduction 

^ I Automated video analysis is important for many vision applications such as surveillance, 

traffic monitoring, augmented reality, vehicle navigation, etc. [[H], |l2l|. As pointed out in |[1]], 



o 



C/2 

I there are three key steps for automated video analysis: object detection, object tracking and 

Cs( ■ behavior recognition. As the first step, object detection aims to locate and segment interesting 

>: 

. objects in a video. Then, such objects can be tracked from frame to frame, and the tracks can 

00 

g : be analyzed ,o reeognize objec, behavior. Tlrus, objec, detection plays a critical role ,n practrcal 

. applications. 

Object detection is usually achieved by object detectors or background subtraction [[T]]. An 
object detector is often a classifier that scans the image by a sliding window and labels each 
^ . subimage defined by the window as either object or background. Generally, the classifier is built 

^ ■ by offline learning on separate datasets [|3|], [m or by online learning initialized with a manually 

labeled frame at the start of a video [|5]], [|6l]. Alternatively, background subtraction [7 J compares 
images with a background model and detects the changes as objects. It usually assumes that no 
object appears in images when building the background model [|2l. Such requirements of 
training examples for object or background modeling actually limit the applicability of above 
mentioned methods in automated video analysis. 
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Fig. 1. Two examples to illustrate the problem, (a) A sequence of 40 frames, where a walking lady is recorded by 
a hand-held camera. From left to right are the 1st, 20th and 40th frames, (b) A sequence of 48 frames clipped from 
a surveillance video at the airport. From left to right are the 1st, 24th and 48th frames. Notice that the escalator is 
moving. The objective is to segment the moving people automatically without extra inputs. 

Another category of object detection methods that can avoid training phases are motion-based 
methods [[U, JZ]], which only use motion information to separate objects from the background. 
The problem can be rephrased as follows. Given a sequence of images in which foreground 
objects are present and moving differently from the background, can we separate the objects 
from the background automatically? Fig. [2a) shows such an example, where a walking lady is 
always present and recorded by a hand-held camera. The goal is to take the image sequence as 
input and directly output a mask sequence of the walking lady. 

The most natural way for motion-based object detection is to classify pixels according to 
motion patterns, which is usually named motion segmentation [[91, ifTOll . These approaches achieve 
both segmentation and optical flow computation accurately and they can work in the presence of 
large camera motion. However, they assume rigid motion [9 J or smooth motion [lOJ in respective 
regions, which is not generally true in practice. In practice, the foreground motion can be very 
complicated with nonrigid shape changes. Also, the background may be complex, including 
illumination changes and varying textures such as waving trees and sea waves. Fig. \V[b) shows 
such a challenging example. The video includes an operating escalator, but it should be regarded 
as background for human tracking purpose. An alternative motion-based approach is background 
estimation [iTTIl . [fT2ll . Different from background subtraction, it estimates a background model 
directly from the testing sequence. Generally, it tries to seek temporal intervals inside which the 
pixel intensity is unchanged and uses image data from such intervals for background estimation. 
However, this approach also relies on the assumption of static background. Hence, it is difficult 
to handle the scenarios with complex background or moving cameras. 
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In this paper, we propose a novel algorithm for moving object detection, which falls into 
the category of motion-based methods. It solves the challenges mentioned above in a unified 
framework named DEtecting Contiguous Outliers in the LOw-rank Representation (DECOLOR). 
We assume that the underlying background images are linearly correlated. Thus, the matrix 
composed of vectorized video frames can be approximated by a low-rank matrix, and the moving 
objects can be detected as outliers in this low-rank representation. Formulating the problem as 
outlier detection allows us to get rid of many assumptions on the behavior of foreground. The 
low-rank representation of background makes it flexible to accommodate the global variations 
in the background. Moreover, DECOLOR performs object detection and background estima- 
tion simultaneously without training sequences. The main contributions can be summarized as 
follows: 

1 . We propose a new formulation of outlier detection in the low-rank representation, in which 
the outlier support and the low-rank matrix are estimated simultaneously. We establish the 
link between our model and other relevant models in the framework of Robust Principle 
Component Analysis (RPCA) ifTSll . Different from other formulations of RFC A, we model 
the outlier support explicitly. DECOLOR can be interpreted as -penalty regularized RFC A, 
which is a more faithful model for the problem of moving object segmentation. Following the 
novel formulation, an effective and efficient algorithm is developed to solve the problem. We 
demonstrate that, although the energy is non-convex, DECOLOR achieves better accuracy 
in terms of both object detection and background estimation compared against the state-of- 
the-art algorithm of RFCA [pJl. 

2. In other models of RFCA, no prior knowledge on the spatial distribution of outliers has 
been considered. In real videos, the foreground objects usually are small clusters. Thus, 
contiguous regions should be preferred to be detected. Since the outlier support is modeled 
explicitly in our formulation, we can naturally incorporate such contiguity prior using 
Markov Random Fields (MRFs) lO. 

3. We use a parametric motion model to compensate for camera motion. The compensation of 
camera motion is integrated into our unified framework and computed in a batch manner 
for all frames during segmentation and background estimation. 

The MATLAB implementation of DECOLOR, experimental data and more results are publicly 
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available at: http://bioinformatics.ust.hk/decolor/decolor.html. 

II. Related Work 

Previous methods for object detection are vast, including object detectors (supervised learning), 
image segmentation, background subtraction, etc. [IJ. Our method aims to segment objects based 
on motion information and it comprises a component of background modeling. Thus, motion 
segmentation and background subtraction are the most related topics to this paper. 

A. Motion Segmentation 

In motion segmentation, the moving objects are continuously present in the scene, and the 
background may also move due to camera motion. The target is to separate different motions. 

A common approach for motion segmentation is to partition the dense optical-flow field ifTSll . 
This is usually achieved by decomposing the image into different motion layers lfT6ll . ifTTll . ifTOll . 
The assumption is that the optical-flow field should be smooth in each motion layer, and sharp 
motion changes only occur at layer boundaries. Dense optical flow and motion boundaries are 
computed in an alternating manner named motion competition [fTOl . which is usually implemented 
in a level set framework. The similar scheme is later applied to dynamic texture segmentation 
ifTSll . [fT9l , EOl . While high accuracy can be achieved in these methods, accurate motion analysis 
itself is a challenging task due to the difficulties raised by aperture problem, occlusion, video 
noises, etc. fT[\ . Moreover, most of the motion segmentation methods require object contours 
to be initialized and the number of foreground objects to be specified [lOJ. 

An alternative approach for motion segmentation tries to segment the objects by analyzing 
point trajectories [[9l, [l22ll . |[23]l . [l24ll . Some sparse feature points are firstly detected and tracked 
throughout the video and then separated into several clusters via subspace clustering fl25l1 or 
spectral clustering [24J. The formulation is mathematically elegant and it can handle large camera 
motion. However, these methods require point trajectories as input and only output a segmentation 
of sparse points. The performance relies on the quality of point tracking and postprocessing is 
needed to obtain the dense segmentation ll26ll . Also, they are limited when dealing with noisy 
data and nonrigid motion [25 J. 
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B, Background Subtraction 

In background subtraction, the general assumption is that a background model can be obtained 
from a training sequence that does not contain foreground objects. Moreover, it usually assumes 
that the video is captured by a static camera [7J. Thus, foreground objects can be detected by 
checking the difference between the testing frame and the background model built previously. 

A considerable number of works have been done on background modeling, i.e. building 
a proper representation of the background scene. Typical methods include single Gaussian 
distribution f2n\. Mixture of Gaussian f2^, kernel density estimation [|29l , [l30l . block correlation 
II3TII . codebook model [32], Hidden Markov model |[33]l . l[34ll and linear autoregressive models 

m, [El. 

Learning with sparsity has drawn a lot of attentions in recent machine learning and computer 
vision research [|37]| . and several methods based on the sparse representation for background 
modeling have been developed. One pioneering work is the eigen backgrounds model (381, where 
the principle component analysis (PCA) is performed on a training sequence. When a new frame 
is arrived, it is projected onto the subspace spanned by the principle components, and the residues 
indicate the presence of new objects. An alternative approach that can operate sequentially is 
the sparse signal recovery [l39l . BOl . [l4T1| . Background subtraction is formulated as a regression 
problem with the assumption that a new-coming frame should be sparsely represented by a 
linear combination of preceding frames except for foreground parts. These models capture the 
correlation between video frames. Thus, they can naturally handle the global variations in the 
background such as illumination change and dynamic textures. 

Background subtraction methods mentioned above rarely consider the scenario where the 
objects appear at the start and continuously present in the scene {i.e. the training sequence is 
not available). Few literatures consider the problem of background initialization [ITTT| . [l42| . Most 
of them seek a stable interval, inside which the intensity is relatively smooth for each pixel 
independently. Pixels during such intervals are regarded as background, and the background 
scene is estimated from these intervals. The validity of this approach relies on the assumption of 
static background. Thus, it is limited when processing dynamic background or videos captured 
by a moving camera. 
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III. Contiguous Outlier Detection in the Low-Rank Representation 

In this section, we focus on the problem of detecting contiguous outhers in the low-rank 
representation. We first consider the case without camera motion. We will discuss the scenarios 
with moving cameras in Section [IVl 

A. Notations 

In this paper, we use following notations. Ij G denotes the j-th frame of a video sequence, 
which is written as a column vector consisting of m pixels. The i-th pixel in the j-th frame is 
denoted as ij. D = [Ii, • • • , 1^] G W^^^ is a matrix representing all n frames of a sequence. 
B G MJ^^^ is a matrix with the same size of which denotes the underlying background 
images. S G {0, ij^^^ is a binary matrix denoting the foreground support: 



We use Vs{X) to represent the orthogonal projection of a matrix X onto the linear space of 
matrices supported by S: 



and Vs^{X) be its complementary projection, i.e. Vs{X) + Vs^{X) = X. 

Four norms of a matrix are used throughout this paper. ||X||o denotes the ^o-norm, which counts 



the number of nonzero entries. ||X||i = \Xij\ denotes the ^i-norm. \\X\\f = 'sJ^ijXf- is 
the Frobenius norm. \\X\\^ means the nuclear norm, i.e. sum of singular values. 

B. Formulation 

Given a sequence D, our objective is to estimate the foreground support S as well as the 
underlying background images B. To make the problem well-posed, we have following models 
to describe the foreground, the background and the formation of observed signal: 

Background model: The background intensity should be unchanged over the sequence except 
for variations arising from illumination change or periodical motion of dynamic texturei]. Thus, 

^Background motion caused by moving cameras will be considered in Section HVl 




0, if ij is background 

1, if ij is foreground 



(1) 
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background images are linearly correlated with each other, forming a low-rank matrix B. Besides 
the low-rank property, we don't make any additional assumption on the background scene. Thus, 
we only impose the following constraint on B: 

rank(5) < K, (3) 

where X is a constant to be predefined. Intrinsically, K constrains the complexity of the 
background model. We will discuss more on this parameter in Section IV-A[ 

Foreground model: The foreground is defined as any object that moves differently from the 
background. Foreground motion gives intensity changes that can not be fitted into the low-rank 
model of background. Thus, they can be detected as outliers in the low-rank representation. 
Generally, we have a prior that foreground objects should be contiguous pieces with relatively 
small size. The binary states of entries in foreground support S can be naturally modeled by a 
Markov Random Field [|43]1 . [fT4ll . Consider a graph Q = (V,^), where V is the set of vertices 
denoting all m x n pixels in the sequence and E is the set of edges connecting spatially or 
temporally neighboring pixels. Then, the energy of S is given by the Ising model [fT4ll : 

^ij{Sij) + ^ ~ Skl\'> (4) 

where Uij denotes the unary potential of Sij being or 1, and the parameter Xij^^i > controls the 
strength of dependency between Sij and S^. To prefer Sij = that indicates sparse foreground, 
we define the unary potential Uij as! 

f 0, if S,j = 
Uij{Sij) = < , (5) 

I A^j, if Sij = 1 

where the parameter Xij > penalizes Sij = 1. For simplicity, we set Xij and Xij^ki as constants 
over all locations. That is, Xij = /3 and Xij^ki = 7. where (3 > and 7 > are positive constants. 
This means that we have no additional prior about the locations of objects. 

Signal model: The signal model describes the formation of D, given B and S. In the 
background region where 5^^^ = 0, we assume that Dij = Bij + e^j, where Cij denotes i.i.d. 
Gaussian noise. That is, Dij ^ N{Bij, a^) with being the variance of Gaussian noise. Thus, 
Bij should be the best fitting to Dij in the least-squares sense, when Sij = 0. In the foreground 
regions where Sij = 1, the background scene is occluded by the foreground. Thus, Dij equals 
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the foreground intensity. Since we don't make any assumption about the foreground appearance, 
Dij is not constrained when Sij = 1. 

Combining above three models, we propose to minimize the following energy to estimate B 
and S: 



1 



s.t. rank(S) < K. (6) 

This formulation says that the background images should form a low-rank matrix and fit the 
observed sequence in the least-squares sense except for foreground regions that are sparse and 
contiguous. 

To make the energy minimization tractable, we relax the rank operator on B with the nuclear 
norm. The nuclear norm has proven to be an effective convex surrogate of the rank operator [|44ll . 
Moreover, it can help to avoid overfitting, which will be illustrated by experiments in Section 
IV-A21 

Writing (|6]) in its dual form and introducing matrix operators, we obtain the final form of the 
energy function: 

min J||P^4i^-5)|||. + a||5||*+/3||5||i+7Pvec(5)||i. (7) 

Here, A is the node-edge incidence matrix of Q, and a > is a parameter associated with 
which controls the complexity of the background model. Proper choice of a, (3 and 7 will be 
discussed in details in Section IIII-C3[ 



ki\ 



C. Algorithm 

The objective function defined in (|7]) is non-convex and it includes both continuous and discrete 
variables. Joint optimization over B and S is extremely difficult. Hence, we adopt an alternating 
algorithm that separates the energy minimization over B and S into two steps. S-step is a convex 
optimization problem and S^-step is a combinatorial optimization problem. It turns out that the 
optimal solutions of S-step and S'-step can be computed efficiently. 
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1 ) Estimation of the low-rank matrix B: Given an estimate of the support S, the minimization 
in d?]) over B turns out to be the matrix completion problem HSll : 

mm l\\Vs.{D-B)\\l + a\\B\l. (8) 

This is to learn a low-rank matrix from partial observations. The optimal B in ^ can be 
computed efficiently by the SOFT-IMPUTE algorithm [45 J, which makes use of the following 
Lemma [|46I1 : 

Lemma 1: Given a matrix Z, the solution to the optimization problem 

m:m]-\\Z-X\\l + a\\X\l (9) 
is given by X = 9^(2'), where 9^ means the singular value thresholding: 

9,(z) = (10) 

Here, S^, = diag[((ii - . . . , (rf^ - [/S]/^ is the SVD of Z, S = diag[(ii, . . . , 4] and 
t+ max(t, 0). 

Rewriting ([8]), we have 



(11) 



Using Lemma 1, the optimal solution to ([8]) can be obtained by iteratively using: 

5^9,(P^4i^)+P^(S)). (12) 



with arbitrarily initialized B. Please refer to II45I1 for the details of SOFT-IMPUTE and the proof 
of its convergence. 

2) Estimation of the outlier support S: Next, we investigate how to minimize the energy in 
© over S given the low-rank matrix B. Noticing that Sij G {0, 1}, the energy can be rewritten 
as follows: 

^\\rs4D-B)\\l + P\\S\U+^\\Ayec{S)\\, 
=\ E (^^^- - - + /5 E + 7 11^ vec(5) 111 

= E - l^^i^ - Bijf)Sij + 7Pvec(^)||i + C, (13) 
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where C = | (Aj — BijY is a constant when B is fixed. Above energy is in the standard 
form of the first-order MRFs with binary labels, which can be solved exactly using graph cuts 

ii, El. 

Ideally, both spatial and temporal smoothness can be imposed by connecting all pairs of nodes 
in Q which correspond to all pairs of spatially or temporally neighboring pixels in the sequence. 
However, this will make Q extremely large and difficult to solve. In implementation, we only 
connect spatial neighbors. Thus, Q can be separated into subgraphs of single images, and the 
graph cuts can be operated for each image separately. This dramatically reduces the computational 
cost. Based on our observation, the spatial smoothness is sufficient to obtain satisfactory results. 

3) Parameter tuning: The parameter a in (|7]) controls the complexity of the background 
model. A larger a gives a B with smaller nuclear norm. In our algorithm, we first give a rough 
estimate to the rank of the background model, i.e. K in Then, we start from a large a. 
After each run of SOFT-IMPUTE, if rank(5) < K, we reduce a by a factor r]i <1 and repeat 
SOFT-IMPUTE until rank(S) > K. Using warm-start, this sequential optimization is efficient 
Il45]l . In our implementation, we initialize a to be the second largest singular value of D, and 
r]i = l/^/2. 

The parameter /3 in d?]) controls the sparsity of the outlier support. From ([T3]) we can see that 
Sij is more likely to be 1 if |(Aj ~ BijY > (5. Thus the choice of (5 should depend on the 
noise level in images. Typically we set /3 = 4.5a^, where is estimated online by the variance 
of Dij — Bij. Since the estimation of B and a is biased at the beginning iterations, we propose 
to start our algorithm with a relatively large /3, and then reduce /3 by a factor r]2 = 0.5 after 
each iteration until f3 reaches 4.5a^. In other words, we tolerate more error in model fitting at 
the beginning, since the model itself is not accurate enough. With the model estimation getting 
better and better, we decrease the threshold and declare more and more outliers. 

In conclusion, we only have two parameters to choose, i.e. K and 7. In Section [V- A2I we will 
show that DECOLOR performs stably if K and 7 are in proper ranges. In all our experiments, 
we let K = y/n, and 7 = /3 and 5/3 for simulation and real sequences, respectively. 

4) Convergence: For fixed parameters, we always minimize a single lower-bounded energy 
in each step. The convergence property of SOFT-IMPUTE has been proved in ll45]l . Therefore, 
the algorithm must converge to a local minimum. For adaptive parameter tuning, our strategy 
guarantees that the coefficients (a, /3, 7) keep decreasing for each change. Thus, the energy in © 
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decreases monotonically with the algorithm running. Furthermore, we can manually set lower 
bounds for both a and /3 to stop the iteration. Empirically, DECOLOR converges in about 20 
iterations for a convergence precision of 10"^. 

D. Relation to Other Methods 

1 ) Robust Principle Component Analysis: RPCA has drawn a lot of attention in computer 
vision [|49l , [l50ll . Recently, the seminal work lfT3]l shows that, under some mild conditions, the 
low-rank model can be recovered from unknown corruption patterns via a convex program named 
Principal Component Pursuit (PCP). The examples in lfT3]l demonstrate the superior performance 
of PCP compared with previous methods of RPCA and its promising potential for background 
subtraction. 

As discussed in fTT], PCP can be regarded as a special case of the following decomposition 
model: 

D = B + E + e, (14) 

where 5 is a low-rank matrix, E represents the intensity shift caused by outliers and e denotes 
the Gaussian noise. PCP only seeks for the low-rank and sparse decomposition D = B + E 
without considering e. Recently, Stable Principle Component Pursuit (SPCP) has been proposed 
lISTIl . It extends PCP [Il3l1 to handle both sparse gross errors and small entry wise noises. It tries 
to find the decomposition by minimizing the following energy: 

min -\\D-B-E\\l + amnk(B)+f3\\E\\o. (15) 

B,E 2 

To make the optimization tractable, (fTSl) is relaxed by replacing rank(S) with \\B\\^ and ||£'||o 
with II £"111 in PCP or SPCP. Thus, the problem turns out to be convex and can be solved 
efficiently via convex optimization. However, the £i relaxation requires that the distribution of 
corruption should be sparse and random enough, which is not generally true in the problem of 
motion segmentation. Experiments in Section [V] show that PCP is not robust enough when the 
moving objects take up relatively large and contiguous space of the sequence. 

Next, we shall explain the relation between our formulation in (|7]) and the formulation in ([T5]) . 
It is easy to see that, as long as Eij 7^ 0, we must have Eij = Dij — Bij to minimize (fTSj) . Thus, 
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([T5]) has the same minimizer with the following energy: 

min ^ '^{Dij-Bijf + amnk{B)+f3\\E\\o. (16) 

ij:Eij=0 

The first term in can be rewritten as l\\Vs^{D - B)\\%. Noticing that \\E\\q = \\S\\i and 
replacing rank(5) with ||5||*, ([T6h can be finally rewritten as (|7]) if the last smoothness term in 
© is ignored. 

Thus, DECOLOR can be regarded as a special form of RPCA, where the ^o-pcnalty on E is 
not relaxed and the problem in (fTSl) is converted to the optimization over in ([S]). One recent 
work [l52l] has shown that the -penalty works effectively for outlier detection in regression, 
while the -penalty does not. As pointed out in |[52ll . the theoretical reason for the unsatisfactory 
performance of the -penalty is that the irrepresentable condition |[53]l is often not satisfied in the 
outlier detection problem. In order to go beyond the -penalty, non-convex penalties have been 
explored in recent literature |[52ll . |[54ll . Compared with the ^i-norm, non-convex penalties give 
an estimation with less bias but higher variance. Thus, these non-convex penalties are superior 
to the ^1 -penalty when the signal-noise-ratio (SNR) is relatively high |[54ll . For natural video 
analysis, it is the case. 

In summary, both PCP [»13i1 and DECOLOR aim to recover a low-rank model from corrupted 
data. PCP lfT3]l . lISTll uses the convex relaxation by replacing rank(5) with \\B\\^ and ||£'||o 
with II £"111. DECOLOR only relaxes the rank penalty and keeps the -penalty on E to preserve 
the robustness to outliers. Moreover, DECOLOR estimates the outlier support S explicitly by 
formulating the problem as the energy minimization over S, and models the continuity prior on 
S using MRFs to improve the accuracy of detecting contiguous outliers. 

2) Sparse signal recovery: With the success of compressive sensing lISSll . sparse signal 
recovery has become a popular framework to deal with various problems in machine learning and 
signal processing ll^ . Il56ll . Il57ll . To make use of structural information about nonzero patterns of 
variables, the structured- sparsity is defined in recent works |[58ll . |[59ll . and several algorithms have 
been developed and applied successfully on background subtraction, such as Lattice Matching 
Pursuit (LaMP) [1391], Dynamic Group Sparsity (DOS) recovery HOll and Proximal Operator using 
Network Flow (ProxFlow) EU. 

In sparse signal recovery for background subtraction, a testing image y G is modeled as 
a sparse linear combination of n previous frames $ G W^^^ plus a sparse error term e G 
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and a Gaussian noise term e G M^: 

y = + e + e. (17) 

It; G is the coefficient vector. The first term accounts for the background shared between y 
and while the sparse error e corresponds to the foreground in y. Thus, background subtraction 
can be achieved by recovering w and e. Taking the latest algorithm ProxFlow [l4T1l as an example, 
the following optimization is proposed: 

min ^11?/ -^w- e\\l + Ai||^||i + A2||e||^,/^^, (18) 

where || • is a norm to induce the group- sparsity. Please refer to [41] for the detailed 

definition. In short, the i\j i^-vioxm is used as a structured regularizer to encode the prior that 
nonzero entries of e should be in a group structure, where the groups are specified to be all 
overlapping 3 x 3-squares on the image plane ll4T]l . 

In ([TtI) . $ can be interpreted as a basis matrix for linear regression to fit the testing image y. 
In the literatures mentioned above, $ is fixed to be the training sequence [l4T]l or previous frames 
on which background subtraction has been performed HOll . Then, the only task is to recover the 
sparse coefficients. 

In our problem formulation, $ is unknown. DECOLOR learns the bases and coefficients for 
a batch of test images simultaneously. To illustrate this, we can rewrite ([T4l) as: 

= W + £; + e, (19) 

where the original low-rank B is factorized as a product of a basis matrix $ G M^^^ and a 
coefficient matrix W ^W^"^ with r being the rank of B, 

In summary, LaMP, DCS and ProxFlow aim to detect new objects in a new testing image 
given a training sequence not containing such objects. The problem is formulated as linear 
regression with fixed bases. DECOLOR aims to segment moving objects from a short sequence 
during which the objects continuously appear, which is a more challenging problem. To this 
end, DECOLOR estimates the foreground and background jointly by outlier detection during 
matrix learning. The difference between DECOLOR and sparse signal recovery will be further 
demonstrated using experiments on real sequences in Section IV-B1[ 
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IV. Extension to Moving Background 

Above derivation is based on the assumption that the videos are captured by static cameras. 
In this section, we introduce domain transformations into our model to compensate for the 
background motion caused by moving cameras. Here we use the 2D parametric transforms |[60ll 
to model the translation, rotation and planar deformation of the background. 

Let Dj o Tj denote the j-ih frame after the transformation parameterized by vector tj G MP, 
where p is the number of parameters of the motion model (e.g. p = 6 for the affine motion or 
p = 8 for the projective motion). Then the proposed decomposition becomes Dor = B + E + e, 
where D or = [Di o n, • • • , o r^] and r G W^^ is a vector comprising all tj. A similar idea 
can be found in the recent work on batch image alignment [i57l1 . 

Next, we substitute D in ^ with Dor and estimate r along with B, S by iteratively 
minimizing: 

min hvs^ {DoT-B)\\l, + a \\B\U + /3 \\S\U + 7 \\Ayec{S)\\i. (20) 
Now we investigate how to minimize the energy in ([20h over r, given B and S: 

f = argmin \\Vs±{D o r - B)\\l. (21) 

T 

Here we use the incremental refinement lISTll . |[60ll to solve this parametric motion estimation 
problem: at each iteration, we update f by a small increment At and linearize Dot as Dot + 
Jf^T, where Jf denotes the Jacobian matrix Thus, r can be updated in the following 

way: 

T^T + argmin \\rs±{D ot-B + JfAr)|||. (22) 

At 

The minimization over At in ([22|) is a weighted least-squares problem, which has a closed- 
form solution. 

In practice, the update of ri , • • • , can be done separately since the transformation is applied 
on each image individually. Thus the update of r is efficient. To accelerate the convergence of 
DECOLOR, we initialize r by roughly aligning each frame Dj to the middle frame D]i before 
the main loops of DECOLOR. The pre-alignment is done by the robust multiresolution method 
proposed in [l6T1l . 

All steps of DECOLOR with adaptive parameter tuning are summarized in Algorithm [B 
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Algorithm 1 Moving Object Segmentation by DECOLOR 

1. Input:D = [Ji, • • • , /n] e M"^^" 

2. Initialize: r, B ^ D o t, S ^ 0,a, /3. 

3. repeat 

4. f ^ f + argmin \\Vs±{D or- B + Jf Ar)||^; 

At 

5. repeat 

7. until convergence 

8. if rank(S) < K then 

9. Of ^ ?7ia; 

10. go to Step 5; 

11. end if 

12. estimate a; 

13. p ^ max {r]2p, 4.5(J^); 

14. ^ ^ argminE (/^ - |([^ of].,- - + vec(5) ||i 

ij 

15. until convergence 

16. Output: S,S',f 



V. Experiments 

A. Simulation 

In this section, we perform numerical experiments on synthesized data. We consider the situ- 
ations with no background motion and mainly investigate whether DECOLOR can successfully 
separate the contiguous outliers from the low-rank model. 

To better visualize the data, we use a simplified scenario: the video to be segmented is 
composed of ID images. Thus, the image sequence and results can be displayed as 2D matrices. 
We generate the input D by adding a foreground occlusion with support Sq to a background 
matrix Bq. The background matrix Bq with rank r is generated as Bq = UV^ where U and 
y are m X r and n x r matrices with entries independently sampled from a standard normal 
distribution. We choose m = 100, n = 50 and r = 3 for all experiments. Then, an object with 
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(a)Data (b)Truth (c) PCP (d) DECOLOR 



Fig. 2. (a) An example of synthesized data. Sequence D G Ri^ox^o is a matrix composed of 50 frames of ID images 
with 100 pixels per image, (b) The foreground support 5*0 and underlying background images Bq. rank(5o) = 3. I) 
is generated by adding a foreground object with width = 40 to each column of Bq, which moves downwards for 
1 pixel per column. Also, i.i.d. Gaussian noise is added to each entry, and SNR = 10. (c) The results of PCP. The 
top panel is S and the bottom panel is ^. of PCP is obtained by thresholding \Dij — Bij \ with a threshold that 
gives the largest F-measure. Notice the artifacts in both S and B estimated by PCP. (d) The results of DECOLOR. 
Here S is directly output by DECOLOR without postprocessing. 



width W is superposed on each column of and shifts downwards for 1 pixel per column. The 
intensity of this object is independently sampled from a uniform distribution U{—c, c), where c 
is chosen to be the largest magnitude of entries in B^. Also, we add i.i.d. Gaussian noise e to 
D with the corresponding signal-to-noise ratio (SNR) defined as: 



SNR=J^^. (23) 

V var(6) 

Fig. [2ta) shows an example, where the moving foreground can be recognized as contiguous 
outliers superposed on a low-rank matrix. Our goal is to estimate Sq and recover B^ at the same 
time. 

For quantitative evaluation, we measure the accuracy of outlier detection by comparing S with 
aSq. We regard it as a classification problem and evaluate the results using precision and recall, 
which are defined as: 

TP TP 

precision = — — — — — , recall = — — — — — , (24) 
^ TP + FP' TP + FN' ^ ^ 

where TP, FP, TN and FN mean the numbers of true positives, false positives, true negatives and 

false negatives, respectively. Precision and recall are widely used when the class distribution 
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Fig. 3. Quantitative evaluation, (a) F-measure and RMSE as functions of W, when SNR = 10. (b) F-measure and 
RMSE as functions of SNR, when W = 2b. (c) The effects of parameters, i.e. K and 7. The results are averaged 
over 50 random trials with W = 2b and SNR = 10. The top panel shows the effect of K. The true rank of Bq is 
3. The accuracy increases sharply when K changes from 1 to 3 and decreases smoothly after K is larger than 3. 
The bottom panel shows the effect of 7. The accuracy keeps stable within 10/3]. 



is skewed [62J. For simplicity, instead of plotting precision/recall curves, we use a single 
measurement named F-measure that combines precision and recall: 

_ ^ precision • recall 

F-measure = 2 — . (25) 

precision + recall 

The higher the F-measure is, the better the detection accuracy is. On our observation, PCP 
requires proper thresholding to generate a really sparse 5". For fair comparison, S of PCP 
is obtained by thresholding \Dij — Bij\ with a threshold that gives the maximal F-measure. 
Furthermore, we measure the accuracy of low-rank recovery by calculating the difference between 
B and Bq. We use the Root Mean Square Error (RMSE) to measure the difference: 

5 II 

RMSE = " (26) 
II^oIIf 

1 ) Comparison to PCP: Fig. [2] gives a qualitative comparison between PCP and DECOLOR. 
Fig. Oc) presents the results of PCP. Notice the artifacts in B that spatially coincide with Sq, 
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which shows that the -penalty is not robust enough for relatively dense errors distributed in 
a contiguous region. Fig. Od) shows the results of DECOLOR. We see less false detections in 
estimated S compared with PCP. Also, the recovered B is less corrupted by outliers. 

For quantitative evaluation, we perform random experiments with different object width W 
and SNR. Fig. [3ta) reports the numerical results as functions of W. We can see that all methods 
achieve a high accuracy when W = 10, which means all of them work well when outliers 
are really sparse. As W increases, the performance of PCP degrades significantly, while that 
of DECOLOR keeps less affected. This demonstrates the robustness of DECOLOR. The result 
of DECOLOR with 7 = falls in between those of PCP and DECOLOR with 7 = /3, and it 
has a larger variance. This shows the importance of the contiguity prior. Moreover, we can find 
that DECOLOR gives a very stable performance for outlier detection (F-measure), while the 
accuracy of matrix recovery (inverse to RMSE) drops obviously as W increases. The reason is 
that some background pixels are always occluded when the foreground is too large, such that 
they can not be recovered even when the foreground can be detected accurately. 

Fig. [S^b) shows the results under different noise levels. DECOLOR maintains better perfor- 
mance than PCP if SNR is relatively high, but drops dramatically after SNR < 2. This can 
be interpreted by the property of non-convex penalties. Compared with ^i-norm, non-convex 
penalties are more robust to gross errors L63I1 but more sensitive to entry wise perturbations |[54ll . 
In general cases of natural video analysis, SNR is much larger than 1. Thus, DECOLOR can 
work stably. 

2) Effects of parameters: Fig.[3l[c) demonstrates the effects of parameters in Algorithm [H i.e. 
K and 7. 

The parameter K is the rough estimate of rank(5o), which controls the complexity of the 
background model. Here, the true rank of 5o is 3. From the top plot in Fig. [3tc), we can see 
that the optimal result is achieved at the turning point where K = 3. After that, the accuracy 
decreases very smoothly as K increases. This insensitivity to K is attributed to the shrinkage 
effect of the nuclear norm in (|7]), which plays an important role to prevent overfitting when 
estimating B. Specifically, given parameters K and a, the singular values of B are always 
shrunk by a due to the soft- thresholding operator in (fTOl) . Thus, our model overfits slowly when 
K is larger than the true rank. Similar results can be found in [45 J. 

The parameter 7 controls the strength of mutual interaction between neighboring pixels. From 
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Fig. 4. Simulation to illustrate inseparable cases of DECOLOR, (a) F-measure as a function of d, where d is the 
number of frames within which the foreground stops moving. The true rank of Bq is 3. (b) Fraction of trials of 
accurate foreground detection (F-measure>0.95) over 200 trials, as a function of ap and W. Here, ap represents 
the standard deviation of foreground intensities and W denotes the foreground width, cr^ is the standard deviation 
of Bo. 



the bottom plot in Fig. [3tc), we can see that the performance keeps very stable when 7 G 10/3] . 

3) Inseparable cases: In previous simulations, the foreground is always moving and the 
foreground entries are sampled from a uniform distribution with a relatively large variance. 
Under these conditions, DECOLOR performs effectively and stably for foreground detection 
(F-measure) unless SNR is too bad. Next, we would like to study the cases when DECOLOR 
can not separate the foreground from the background correctly. 

Firstly, we let the foreground not move for d frames when generating the data. Fig. |4ta) 
shows the averaged F-measure as a function of d. Here, rank(5o) = 3. We can see that, with 
the default parameter K = 7, the accuracy of DECOLOR will decrease dramatically as long as 
d > 0. This is because DECOLOR overfits the static foreground into the background model, as 
the model dimension K is larger than its actual value. When we decrease X to 3, DECOLOR 
performs more stably until d > 6, which means that DECOLOR can tolerate temporary stopping 
of foreground motion. In short, when the object is not always moving, DECOLOR becomes 
more sensitive to K, and it can not work when the object stops for a long time. 

Next, to investigate the influence of foreground texture, we also run DECOLOR on random 
problems with outlier entries sampled from uniform distributions with random mean and different 
variances a^. Fig. Il^b) displays the fraction of trials in which DECOLOR gives a high accuracy 
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of foreground detection (F-measure>0.95) over 200 trials, as a 2D function of aj^ and W. The 
result of PCP is also shown for comparison. As we can see, DECOLOR can achieve accurate 
detection with a high probability over a wide range of conditions, except for the upper left corner 
where W is large and ajp is small, which represents the case of large and textureless foreground. 
In practice, the interior motion of a textureless object is undetectable. Thus, its interior region 
will keep unchanged for a relatively long time if the object is large or moving slowly. In this case, 
the interior part of the foreground may fit into the low-rank model, which makes DECOLOR 
fail. 

B. Real Sequences 

We test DECOLOR on real sequences from public datasets for background subtraction, motion 
segmentation and dynamic texture detection. Please refer to Table [I for the details of each 
sequence. 

TABLE I 

Information of the sequences used in experiments. 



Fig. 


Size X #frames 


Ref. 


Description 


Fig. 


Ha) 


[160, 120] X 48 


|42| 


Crowded scene 


Fig. 


Mh) 


[238, 158] X 24 


|18| 


Crowded scene 


Fig. 


Mc) 


[160, 128] X 24 


|64| 


Crowded scene 


Fig. 


Md) 


[160, 128] X 48 


|64| 


Dynamic background 


Fig. 


Ee) 


[160, 128] X 48 


|64| 


Dynamic background 


Fig. 


Ha) 


[320, 240] X 40 


|24| 


Moving cameras 


Fig. 


Hb) 


[320, 240] X 30 


|24| 


Moving cameras 


Fig. 


He) 


[320, 240] X 30 


|24| 


Moving cameras 


Fig. 




[320, 240] X 24 


|24| 


Moving cameras 


Fig. 


m 


[180, 144] X 48 


|20| 


Dynamic foreground 



1 ) Comparison to sparse signal recovery: As discussed in Section IIII-D2[ a key difference 
between DECOLOR and sparse signal recovery is the assumption on availability of training 
sequences. Background subtraction via sparse signal recovery requires a set of background images 
without foreground, which is not always available especially for surveillance of crowded scenes. 
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Data DECOLOR ProxFlow ProxFlow+ 

(a) (b) 

Fig. 5. An example illustrating the difference between DECOLOR and sparse signal recovery, (a) The first and 
the last frames of a sequence of 24 images. Several people are walking and continuously presented in the scene, 
(b) The estimated background (top) and segmentation (bottom) corresponding to the last frame. ProxFlow means 
sparse signal recovery by solving (fTSl) with the ProxFlow algorithm [41 J, where the first 23 frames are used as the 
basis matrix <l> in (fTSl) . ProxFlow+ means applying ProxFlow with bases <l> being the low-rank matrix B learnt by 
DECOLOR. 



Fig. [5ta) gives such a sequence clipped from the start of an indoor surveillance video, where 
the couple is always in the scene. 

Fig-Sb) shows the results of the 24th frame. For sparse signal recovery, we apply the ProxFlow 
algorithmEl [41] to solve the model in (fTSl) . The previous 23 frames are used as the bases 
($ in ([T8])\ Since the subspace spanned by previous frames also includes foreground objects, 
ProxFlow can not recover the background and gives inaccurate segmentation. Instead, DECOLOR 
can estimate a clean background from occluded data. In practice, DECOLOR can be used for 
background initialization. For example, the last column in Fig. Ob) shows the results of running 
ProxFlow with $ being low-rank B learnt by DECOLOR. That is, we use the background images 
recovered by DECOLOR as the training images for background subtraction. We can see that the 
results are improved apparently. 

2) Background estimation: In this part, we test DECOLOR on several real sequences selected 
from public datasets of background subtraction. Since we aim to evaluate the ability of algorithms 

^The code is available at http://www.di.ens.fr/willow/SPAMS/ 
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Fig. 6. Five sub-sequences of surveillance videos. Sequence information is given in Table HI The last frame of each sequence 
and its manual segmentation are shown in Column 1. The corresponding results by four methods are presented from Column 2 
respectively. The top panel is the estimated background and the bottom panel is the segmentation. ui^^r i 
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in detecting moving objects at the start of videos, we focus on short chps composed of beginning 
frames of videos. All examples in Fig. [6] have only 24 or 48 frames corresponding to 1 or 2 
seconds for a frame rate of 24 fps. We compare DECOLOR with three methods that are simple 
in implementation but effective in practice. The first one is PCP [13J, which is the state-of-the-art 
algorithm for RPCA. The second method is median filtration, a baseline method for unimodal 
background modeling. The median intensity value around each pixel is computed forming a 
background image. Then, each frame is subtracted by the background image and the difference 
is thresholded to generate a foreground mask. The advantage of using median rather than mean 
is that it is a more robust estimator to avoid blending pixel values, which is more proper for 
background estimation [1 IJ. The third method is mixture of Gaussians (MoG) |[28]l . It is popularly 
used for multimodal background modeling and has proven to be very competitive compared with 
other more sophisticated techniques for background subtraction Q, |[65]l . 

The sequences and results are presented in Fig. [6l The first example shows an office with 
two people walking around. Although the objects are large and always presented in all frames, 
DECOLOR recovers the background and outputs a foreground mask accurately. Notice that 
the results are direct outputs of Algorithm [I] without any postprocessing. The results of PCP 
are relatively unsatisfactory. Ghosts of foreground remain in the recovered background. This is 
because the -penalty used in PCP is not robust enough to remove the influence of contiguous 
occlusion. Such corruption of extracted background will result in false detections as shown in 
the segmentation result. Moreover, without the smoothness constraint, occasional light changes 
(e.g. near the boundary of fluorescent lamps) or video noises give rise to small pieces of falsely 
detected regions. The results of median filtration depend on how long each pixel is taken by 
foreground. Thus, from the recovered background of median filtration we can find that the man 
near the door is clearly removed while the man turning at the corner leaves a ghost. Despite of 
scattered artifacts, MoG gives less false positives due to its multimodal modeling of background. 
However, blending of foreground intensity can be seen obviously in the recovered background, 
which results in more false negatives in the foreground mask, e.g. the interior region of objects. 
Similar results can be found in next two examples. 

The last two examples include dynamic background. Fig. [6td) presents a sequence clipped 
from a surveillance video of an airport, which is very challenging because the background 
involves a running escalator. Although the escalator is moving, it is recognized as a part of 
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TABLE II 

Quantitative evaluation (F-measure) on the sequences shown in Fig.[61 



Sequence 


DbCULUK 


PCP 


Median 


MoG 


Fig. Ha) 


0.93 


0.62 


0.67 


0.50 


Fig.Hb) 


0.82 


0.66 


0.71 


0.35 


Fig. He) 


0.92 


0.70 


0.79 


0.50 


Fig.Hd) 


0.82 


0.49 


0.51 


0.36 


Fig. He) 


0.91 


0.83 


0.86 


0.47 



background by DECOLOR since its periodical motion gives repeated patterns. As w^e can see, 
the structure of the escalator is maintained in the background recovered by DECOLOR or PCR 
This demonstrates the ability of lovv^-rank representation to model dynamic background. Fig. [6te) 
gives another example a w^ater surface as background. Similarly, the low^-rank modeling of 
background gives better results w^ith less false detections on the w^ater surface, and DECOLOR 
obtains a cleaner background compared against PCR 

We also give a quantitative evaluation for the segmentation results shovv^n in Fig. [6l The 
manual annotation is used as ground truth and the F-measure is calculated. As shov^n in Table 
ini DECOLOR outperforms other approaches on all sequences. 

3) Moving cameras: Next, w^e demonstrate the potential of DECOLOR applied to motion 
segmentation problems using the Berkeley motion segmentation dataseR We use tv^o people 
sequences and tw^elve car sequences, vv^hich are specialized for short-term analysis. Each sequence 
has several annotated frames as the ground truth for segmentation. Fig.[7]shovv^s several examples 
and the results of DECOLOR. The transformed images of are shov^n in Column 2. Notice the 
extrapolated regions shov^n in black near the borders of these images. To minimize the influence 
of this numerical error, w^e constrain these pixels to be background w^hen estimating S, but 
consider them as missing entries vv^hen estimating B. Fig. [7] demonstrates that DECOLOR can 
align the images, learn a background model and detect objects correctly. 

For comparison, w^e also test the motion segmentation algorithm recently developed by Brox 

Ihttp ://lmb.inf ormatik.uni- freiburg . de/resources/datasets/moseg . en.html | 
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Image Transformed Low-rank Segmentation Brox-Malik Truth 




(d) 

Fig. 7. Four sequences captured by moving cameras. Sequence information is given in Table H Only the last frame 
of each sequence and the corresponding results are shown. From Column 2-4 present the results of DECOLOR, 
i.e. the transformed image, the estimated background and the foreground mask. Column 5 shows the results given 
by the Brox and Malik's algorithm for motion segmentation ll24l . The last column shows the ground truth. 

and Malik [24 J. The Brox-Malik algorithm analyzes the point trajectories along the sequence 
and segment them into clusters. To obtain pixel-level segmentation, the variational method ll26ll 
can be applied to turn the trajectory clusters into dense regions. This additional step makes use 
of the color and edge information in images |[26ll . while DECOLOR only uses the motion cue 
and directly generates the segmentation. 

Quantitatively, we calculate the precision and recall of foreground detection, as shown in 
Table [nD In summary, for most sequences with moderate camera motion, the performance of 
DECOLOR is competitive. On the people sequences, DECOLOR performs better. The feet of the 
lady are not detected by the Brox-Malik algorithm. The reason is that the Brox-Malik algorithm 
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TABLE III 

Quantitative evaluation using the sequences from the Berkeley motion segmentation dataset [|24ll . The 

OVERALL RESULT IS THE MEDIAN VALUE OVER ALL people AND car SEQUENCES. 



DECOLOR Brox-Malik |24| 



Sequence 


Precision 


Recall 


Precision 


Recall 


Fig. Eta) 


93.6% 


93.3% 


89.0% 


77.5% 


Fig.Hb) 


92.5% 


96.5% 


91.7% 


89.2% 


Fig. He) 


83.7% 


98.4% 


82.4% 


99.4% 


Fig.Hd) 


72.0% 


98.0% 


76.4% 


99.8% 


Overall 


81.8% 


90.8% 


80.8% 


99.2% 




(a) (b) (c) 

Fig. 8. An example of smoke detection, (a) Sample frame, (b) Estimated background, (c) Segmentation. 

relies on correct motion tracking and clustering [26J, vv^hich is difficult v^hen the object is small 
and moving nonrigidly. Instead, DECOLOR avoids the complicated motion analysis. However, 
DECOLOR works poorly on the cases where the background is a 3D scene with a large depth 
and the camera moves a lot, e.g. the sequences named cars9 and cars 10. This is because the 
parametric motion model used in DECOLOR can only compensate for the planar background 
motion. 

4) Dynamic foreground: Dynamic texture segmentation has drawn some attentions in recent 
computer vision research EOll . ifTSll . While we have shown that DECOLOR can model period- 
ically varying textures like escalators or water surfaces as background, it is also able to detect 
fast changing textures, whose motion has little periodicity and can not be modeled as low-rank. 
Fig. [8] shows such an example, where the smoke is detected as foreground. Here, the background 
behind smoke can not be recovered since it is always occluded. 
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5) Computational cost: Our algorithm is implemented in MATLAB. All experiments are run 
on a desktop PC with a 3.4 GHz Intel i7 CPU and 3 GB RAM. Since the graph cut is operated 
for each frame separately as discussed in Section IIII-C2i the dominant cost comes from the 
computation of SVD in each iteration. The cpu time of DECOLOR for sequences in Fig. [6] 
are 26.2, 13.3, 14.1, 11.4 and 14.4 seconds, while those of PCP are 26.8, 38.0, 15.7, 39.1, and 
21.9 seconds, respectively. All results are obtained with a convergence precision of 10"^. The 
memory cost of DECOLOR and PCP are almost the same, since both of them need to compute 
SVD. The peak values of memory used in DECOLOR for sequences in Fig. [6ta) and Fig. |7tb) 
are around 65 MB and 210 MB, respectively. 

VL Discussion 

In this paper, we propose a novel framework named DECOLOR to segment moving objects 
from image sequences. It avoids complicated motion computation by formulating the problem 
as outlier detection and makes use of the low-rank modeling to deal with complex background. 

We established the link between DECOLOR and PCP Compared with PCP, DECOLOR 
uses the non-convex penalty and MRFs for outlier detection, which is more greedy to detect 
outlier regions that are relatively dense and contiguous. Despite of its satisfactory performance 
in our experiments, DECOLOR also has some disadvantages. Since DECOLOR minimizes a 
non-convex energy via alternating optimization, it converges to a local optimum with results 
depending on initialization of while PCP always minimizes its energy globally. In all our 
experiments, we simply start from S = 0. Also, we have tested other random initialization of S 
and it generally converges to a satisfactory result. This is because the SOFT-IMPUTE step will 
output similar results for each randomly generated S as long as S is not too dense. 

As illustrated in Section IV-A3[ DECOLOR may misclassify unmoved objects or large tex- 
tureless regions as background, since they are prone to entering the low-rank model. To address 
these problems, incorporating additional models such as object appearance or shape prior to 
improve the power of DECOLOR can be further explored in future. 

Currently, DECOLOR works in a batch mode. Thus, it is not suitable for real-time object 
detection. In future, we plan to develop the online version of DECOLOR that can work incre- 
mentally, e.g. the low-rank model extracted from beginning frames may be updated online when 
new frames arrive. 
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